Bayesian multitask learning for medicine recommendation based on online patient reviews

Abstract Motivation We propose a drug recommendation model that integrates information from both structured data (patient demographic information) and unstructured texts (patient reviews). It is based on multitask learning to predict review ratings of several satisfaction-related measures for a given medicine, where related tasks can learn from each other for prediction. The learned models can then be applied to new patients for drug recommendation. This is fundamentally different from most recommender systems in e-commerce, which do not work well for new customers (referred to as the cold-start problem). To extract information from review texts, we employ both topic modeling and sentiment analysis. We further incorporate variable selection into the model via Bayesian LASSO, which aims to filter out irrelevant features. To our best knowledge, this is the first Bayesian multitask learning method for ordinal responses. We are also the first to apply multitask learning to medicine recommendation. The sample code and data are made available at GitHub: https://github.com/thrushcyc-github/BMull. Results We evaluate the proposed method on two sets of drug reviews involving 17 depression/high blood pressure-related drugs. Overall, our method performs better than existing benchmark methods in terms of accuracy and AUC (area under the receiver operating characteristic curve). It is effective even with a small sample size and only a few available features, and more robust to possible noninformative covariates. Due to our model explainability, insights generated from our model may work as a useful reference for doctors. In practice, however, a final decision should be carefully made by combining the information from the proposed recommender with doctors’ domain knowledge and past experience. Availability and implementation The sample code and data are publicly available at GitHub: https://github.com/thrushcyc-github/BMull.


Introduction
In the USA, spending on prescription drugs is high and continues to increase. In 2020, the USA spent $4.1 trillion, or 19.7% of the gross domestic product, on national health expenditures, of which $348.4 billion was spent on prescription drugs (NHE Fact Sheet 2020). Over the next decade, the Centers for Medicare and Medicaid Services (CMS) projects that spending for retail prescription drugs will be the fastest growth health category and will consistently outpace that of other health spending (https://www.iqvia.com/insights/the-iqviainstitute/reports/medicine-spending-and-affordability-in-the-us). To treat a given disease, there may exist a handful of drug options, each with its pros and cons. Different drugs might be more suitable for different patient groups. To recommend the drug that is best suited for a person diagnosed with a certain disease is a challenging yet important task (Bartlett et al. 2000).
With the fast development of data collection and data analysis techniques for both structured and unstructured data, social media and online forums become great data resources for such tasks. In this article, we collect drug review data from an online forum, called WebMD, which is one of the most popular sites for patient reviews. For each reviewer (patient), we have star ratings on three measures (satisfaction level, effectiveness, and ease of use), age, gender, treatment time, and text review content. The collected information may enable us to obtain basic information about patients' health status as well as key characteristics of each drug, all of which constitute an informative knowledge base toward better medicine recommendation.
Consistent with the recent research on data mining using social media data, we develop and evaluate a novel predictive framework in the context of medicine recommendation. In particular, we propose a Bayesian multitask learning (BMul) approach to predict patients' ratings of three satisfaction-related measures for a given medicine. From an application perspective, the proposed approach makes several contributions. Firstly, unlike the existing approaches that might depend on very detailed medical and genetic information, which is usually not easy to obtain, our approach can provide personalized recommendation by making use of commonly available information upon a patient's first visit. Such an information-light approach may augment the applicability of medicine recommendation well and greatly benefit both patients and doctors. Secondly, existing methods often focus on one measure of each medicine, and thus ignore the underlying connection among the different measures and some patients' specific emphasis on some measure(s) over the other measures. The proposed approach combines the prediction of related measures into a unified framework and thus allows the models for different measures to "learn from each other" and thus achieve better performance. Thirdly, by extracting drug features from patient review contents using topic modeling and formally incorporating variable selection through Bayesian LASSO (Park and Casella 2008) to filter irrelevant topics, our approach can utilize text information efficiently and further make recommendation by considering a patient's personal preference about treatment and side effects in addition to his health status. From a methodological perspective, we propose a BMul approach for ordinal regression with a variable selection component using a Laplace prior. Based on a latent variable view of the ordinal probit model, efficient sampling algorithms are implemented via a Gibbs sampler for both the original BMul model and the one with a shrinkage prior. To the best of our knowledge, this is the first BMul method designed for ordinal response variables.
We compare the proposed method (BMul) with several popular classification methods and show the advantage of BMul. In particular, we show that BMul has higher predictive power as compared to the competing methods, especially when the sample size is small. Thus, BMul can be used for new drugs with a limited number of reviews.

Medicine recommendation
Recent developments on medicine recommendation can be categorized into two classes according to their different setups. The first class assumes that the type of disease is known and aims to recommend a single drug that is deemed as the most appropriate treatment based on some prediction method, out of a list of drugs for treating a given health condition. This is similar to our problem setting. Palanivinayagam and Sasikumar (2020) proposed a drug recommendation system aiming at minimizing potential side effects. Balvert et al. (2019) conducted personalized drug prescription based on information from cancer cell lines by selecting the prediction model with the best performance for each drug. In a recent review paper, Romagnoli et al. (2017) discussed the information needed for making clinical recommendations about potential drug-drug interactions. More closely related to our problem setting, Rao et al. (2020) proposed a medicine recommendation system based on patient reviews. However, their method is focused on finding the best drug for a given condition without personalizing the recommendation based on individual patients' information; i.e. their recommendation is at the condition level while our method moves one step further to the patient level.
The second class of methods aims to recommend a list of candidate drugs by identifying patients' symptoms. This class of methods is closely related to the recommendation system widely applied in E-commerce, which addresses the information overload issue by identifying customers with similar behavior patterns and then recommending similar items for potential purchase. Such recommender systems often focus on identifying patient or drug cohorts via clustering and utilize patient-drug interactions, and thus do not apply to our problem setting, due to the well-known cold-start issue (Bobadilla et al. 2013). Sun et al. (2016) utilized electronic medical records to assign patients into patient cohorts. Then, the typical treatment regimen would be estimated for each patient cohort and recommended for incoming patients classified to the same patient cohort. Zhang et al. (2015) proposed a hybrid framework to recommend drugs by matching a new patient's information with a certain diagnosis. Then, a drug cluster would be recommended based on a pretrained symptom-drug classifier. Readers can refer to Stark et al. (2019) for a more detailed overview of this class of medicine recommendation systems.
Due to the differences in problem setting and information needed, the prediction-based methods (i.e. the first class of methods rather than the second class) will be compared with our proposed method in our data analyses.

Multitask learning
When dealing with multiple related or similar modeling tasks, the model performance can be improved by making the tasks "learn from each other." Such methods are usually referred to as multitask learning. Most of existing multitask learning methods were proposed by researchers in areas such as data mining, statistics, and computer science. Those methods can be classified into two large groups. One group works by assuming a certain structure on the parameters across different tasks to enforce the commonality or similarity. Among the earliest attempts, Liao and Carin (2005) and Silver et al. (2008) proposed to use a feed-forward neural network (NN) to learn the best feature transformation by determining the structure of a hidden layer. Obozinski et al. (2006Obozinski et al. ( , 2010 are among the first to conduct feature selection in the multitask learning setting. Another group of methods focuses on learning the relationships among related tasks. Xue et al. (2007) proposed to use a Dirichlet process prior to conduct task clustering. Recently, researchers started to apply multitask learning techniques to healthcare-related problems. More related to our study, Lin et al. (2017) proposed a BMul approach for risk profiling in a chronic care setting. For a more detailed review of multitask learning, readers can refer to Zhang and Yang (2022). To the best of our knowledge, our work is the first multitask learning model proposed for ordinal response variables. Furthermore, we are the first to apply the proposed framework to build a medicine recommendation system.

Data description and system design
We collect patient review data from WebMD.com, a widely used platform for sharing information pertaining to human health and well-being. For a given drug, from each review we have information about the star ratings (1-5) on three measures: satisfaction level, effectiveness, and ease of use; we also have patient information (gender, age group, time since treatment) and the actual review text. In order to capture the main characteristics of the drug reflected through the review contents, we use topic modeling to extract a number of relevant topics from the text information; and for each patient, we create a vector of binary variables to indicate whether each selected topic is covered or not in the review. Details about topic modeling will be discussed in Section 5.
For an incoming patient with a certain health condition, we aim to predict his star ratings on three aspects (satisfaction, effectiveness, and ease of use) of various drugs available for treatment. Using the prediction scores of any aspect, we can generate a ranked list of drugs. Alternatively, an overall preference score, which combines the three aspects, can be used for ranking. To provide final drug recommendations, it is crucial to integrate doctors' domain knowledge with the predicted rankings. An overview of the design of the proposed medicine recommendation system is presented in Fig. 1. Note that we would like to use the proposed method for new patients whose review contents may not be available. Thus, we make use of the text information collected from existing patients only to extract general drug characteristics. In practice, a questionnaire can be designed and then distributed to new patients during their first visits, to collect information about whether they are concerned with/interested in these characteristics. The information will be supplied along with other patient information such as age and gender to the proposed recommendation system for prediction. In case such information is missing, we can impute the corresponding values using the most frequently occurring values from existing patients for the drug.
For illustrative purposes, we focus on two sets of drugs in our analyses, one for treating depression and the other for high blood pressure, for which the descriptive statistics are given in Tables 1 and 2, respectively. In the data sets, we have information about patients' gender, age group, and treatment group besides their ratings. The age group is coded as 1 (age 3-6), 2 (age 7-12), 3 (age 13-18), 4 (age 19-24), 5 (age 25-34), 6 (age 35-44), 7 (age 45-54), 8 (age 55-64), 9 (age 65-74), and 10 (age 75 or over). The treatment group is coded as 1 (on treatment for < 1 month), 2 (on treatment for 1-6 months), 3 (on treatment for 6-12 months), 4 (on treatment for 1-2 years), 5 (on treatment for 2-10 years), and 6 (on treatment for 10 years or more). An interesting observation from the tables is that, for all the drugs, the mean ratings for satisfaction level are lower than both means for ease of use and effectiveness, suggesting that ease of use and effectiveness are not the only factors that affect patients' satisfaction. Information such as side effects may be helpful, which can be reflected in some patients' text reviews.
For each drug, we examine the correlation structure among the three ratings. The correlations for the 11 depression-related drugs and the 6 high blood pressure-related drugs are shown in Table 3. We can see that the three measures are positively correlated for all the drugs, with a majority of the correlations above 0.5. Also, it is interesting to see that the highest correlations are between satisfaction and effectiveness, $0.9 for depressionrelated drugs and $0.75 for high blood pressure-related drugs. These findings show that the three tasks are indeed related, suggesting that a multitask learning framework might be helpful in improving the predictive performance.

Ordinal regression and the latent variable view
We first give a brief overview of the base model we use for predicting star ratings. This serves as a major building block for the BMul model that will be introduced in the next subsection. Note that, throughout the article, the prediction is done separately for each drug. For a specific drug, let y ij 2 f1; 2; . . . ; R ¼ 5g be the rating of the ith reviewer (patient) for measure j, i ¼ 1; . . . ; n, j ¼ 1; . . . ; J ¼ 3, where n is the total number of reviewers available for that drug. A generalized linear regression model for an ordinal response variable with a link function g assumes the following form: where x ij is a column vector of p covariates; for each measure j, a r;j is the intercept that depends on r and satisfies a 1;j < a 2;j < . . . < a RÀ1;j , and b j ¼ ðb 1j ; . . . ; b pj Þ T are the regression coefficients that are assumed common for all r values. In particular, for reviewer i, we set the covariates to be: gender (0 ¼ Female, 1 ¼ Male), age group (recoded to 1-9), treatment length, the quadratic terms of age group and treatment length, and indicators for which topics are covered in the review. There exists a latent variable view for the model (1), which allows us to easily incorporate it into a Bayesian framework (Albert and Chib 1993). Let y Ã ij be an underlying latent variable with where G is the cumulative density function (CDF) of a user-specified distribution. Assume the observed response y ij depends on the unobserved variable y Ã ij through the following relationship, If G is the CDF of a standard logistic distribution, then the resulting model is an ordinal logistic regression model. For an efficient and easy implementation of the Bayesian method introduced in subsequent subsections, we set G to be the CDF of a standard normal distribution U, which is equivalent to as- We will refer to the above model as the ordinal probit model (OP), one of the baseline models used for performance comparison.

The BMul model
To account for the fact that the star ratings on different measures are for the same drug and from the same reviewer, we assume that the prior of b k: ¼ ðb k1 ; . . . ; b kJ Þ T for the kth covariate follows a multivariate normal distribution: We further assign a hyperprior distribution to each R k : R k $ W À1 ðW k ; Þ, which is an inverse Wishart distribution with hyperparameters W k (a J Â J scale matrix) and > J À 1 (the degrees of freedom). In this way, although the parameters of the same covariate for J measures are different, they are set to learn from each other through the dependent multivariate normal prior and the corresponding hierarchical Bayes structure.
We assume the prior distributions of a j ¼ ða 1;j ; . . . ; a RÀ1;j Þ T are a 1;j $ N ð0; 10Þ, a r;j ja rÀ1;j $ N ð0; 10Þ left truncated at a rÀ1;j , r ¼ 2; . . . ; R À 1. Let a ¼ ða 1;1 ; . . . ; a RÀ1;1 ; . . . ; a 1;J ; . . . ; a RÀ1;J Þ T . Then the joint posterior distribution of all parameters and latent variables can be written as: where y ¼ ðy 11 ; . . . ; y 1J ; . . . ; y n1 ; . . . ; y nJ Þ T , is a vector of size nJ, y Ã ¼ ðy Ã 11 ; . . . ; y Ã 1J ; . . . ; y Ã n1 ; . . . ; y Ã nJ Þ T , and X is the corresponding covariate matrix shown below, whose dimension is pJ Â nJ, with x ijk being the kth element of x ij (i.e. the value of covariate k for reviewer i and measure j): We note that in our application, for any patient i, the covariate vectors x ij 's for different star-rating measures are the same, as the information available for each measure is the same. That is, we have x ijk x ij 0 k 8j; j 0 such that each J Â J submatrix of X has identical diagonal components. However, throughout the article, for a general model set-up, we use x ij   rather than x i to allow the covariate vectors for different measures to be different. Later in the variable selection stage, we allow different variables to be selected for different measures and using x ij would make this explicit.

A Simplified BMul model
One simplification of the BMul model is to set a j to be the same for different j in (1). This assumption is plausible in our application because the three measures are for the same drug and on the same scale (from 1 to 5), and so the intercepts might not vary much across these measures. Under this assumption, the joint posterior distribution can be reduced to To examine how this parsimony assumption would affect the performance, we conduct a small simulation study to compare the two versions (same intercepts versus different intercepts across j). We use data from the first three depressionrelated drugs (i.e. Drugs 1-3 in Table 1) as examples and for each drug, we summarize results in Table 4 based on 100 repetitions, where each repetition randomly selects 50 patients for training and then uses the remaining for testing (for more detail, see the beginning of Section 5). Columns 2-5 show the prediction accuracy and AUC (area under the receiver operating characteristic curve) in percentages for running a 5-class model (i.e. each y ij has five levels, one star to five stars), and Columns 6 À 9 for a 2-class model (i.e. each y ij has two levels, < 3 versus ! 3 stars, which can be interpreted as "not recommend" versus "recommend"). For each model, the first two columns show results for assuming a j a and the latter two for a j 6 ¼ a. Unless specified otherwise, the accuracy and AUC values are calculated from the test data by pooling all three measures throughout the article; for both 5-class and 2-class BMul models, they are computed using the predicted probability of "recommend," and the cutoff for computing the accuracy is the default value 0.5. The rows show results for different drugs. We find that the performance for assuming the same intercepts across different j leads to comparable results, both in terms of accuracy and AUC. Therefore, the simplified BMul with constant intercepts will be used for the rest of the article. In addition, since the 5-class BMul performs slightly better than the 2-class BMul, the 5-class label is to be used for the training stage for the rest of the article.

BMul with LASSO for variable selection
In many applications, not all the covariates are useful for the prediction task. In our context, we conjecture that at least some topic indicator variables generated from the review contents might be irrelevant. Further, hospitals and other medical services can have much more information about individual patients. Thus, we introduce a modified BMul model with an added hierarchical structure for the purpose of variable selection, which helps to select important variables among a potentially large pool of candidate variables. To encourage the sparsity of b, we introduce a shrinkage parameter s 2 k for R k , the covariance matrix of b k: . That is, the prior on b k: , and k is a hyperparameter that can be estimated using an empirical Bayesian approach. Such a hierarchical structure is motivated by the seminal work for Bayesian LASSO proposed by Park and Casella (2008). A diagram for our modified Bayesian model with LASSO is shown in Fig. 2, where the rectangular-shaped items denote observed variables and the oval-shaped items denote either parameters or hidden variables.
Letting s 2 ¼ ðs 2 1 ; . . . ; s 2 p Þ T , the conditional posterior distributions for b k: ; s 2 k and R k now become 1 s 2 k js 2 Àk ; y Ã ; y; a; b where IG stands for an inverse-Gaussian distribution, N ¼ diagðN 1 ; . . . ; N p Þ, and the conditional posterior distributions for a and y Ã remain unchanged. The estimates for the

Algorithm implementation and model comparison
Since we set all the priors to be conjugate, as shown in the previous subsections, the conditional posterior distribution of each parameter follows a known distribution. As a result, sampling from the posterior can be easily achieved using Gibbs sampling (Geman and Geman 1984). For all the analyses in the article, we use Geweke's Diagnostics (implemented in R package "coda," Version: 0.19-4) to check the convergence and use the sample means of the second halves of the chains as the estimates.
To examine how the (modified) BMul works in practice, we use the first two depression-related drugs and the first two high blood pressure-related drugs as examples, run 100 repetitions for each drug, and compare the AUC values from the BMul methods with and without the variable selection component (i.e. BMul_lasso, BMul_topic, and BMul_non). Here, BMul_lasso stands for BMul with LASSO for variable selection, BMul_topic for BMul using all the topic-related variables, and BMul_non for BMul excluding all the topic-related variables. Neither BMul_topic nor BMul_non employs variable selection. The AUC values are calculated using the probability of "recommend" as in Table 4, whose boxplots are shown in Fig. 3

by drug.
In situations when some or all variables from topic modeling contain useful information for predicting the star ratings, we expect BMul_topic to perform better than BMul_non; otherwise, BMul_topic would perform comparably to or even worse than BMul_non, due to inclusion of many irrelevant variables. Figure 3 shows the four drugs fit in the different situations, where BMul_lasso mimics the better of the two when BMul_topic and BMul_non perform differently. For depression drug 2, BMul_topic performs better than BMul_non. We can see that BMul_lasso is able to select useful topics and thus produces AUC closer to that of BMul_topic. On the other hand, BMul_topic performs worse than BMul_non for high blood pressure drug 1. In this case, BMul_lasso is able to remove the majority of the topics and thus produces AUC similar to that of BMul_non. For depression drug 1 and high blood pressure drug 2, the performance of BMul_topic and BMul_non is similar, so is that of BMul_lasso. These results suggest that BMul_lasso serves the purpose of variable selection and is effective in selecting the relevant variables (if any) and excluding irrelevant ones. Thus, BMul_lasso is a better choice as we typically do not know whether topic modeling can help or not in creating variables to predict star ratings for a specific drug.
To test the significance of differences between the methods with and without feature selection, we conduct a two-way ANOVA analysis for the logit-transformed AUC with two factors (i.e. method and drug), in which BMul_lasso serves as the baseline method. The negative coefficients for BMul_non and BMul_topic in Table 5 show that BMul_lasso outperforms both. Further, the differences are statistically significant at the 5% significance level.
We refer to BMul with LASSO for variable selection as BMull hereafter. For the performance comparison in Section 5, we will use the BMull version.

Method limitations
The proposed methodology offers an effective means of predicting patients' star ratings on three different aspects of a drug simultaneously. In the field of recommender research, there are many proposed methods for combining criteriaspecific ratings into an overall rating. Simple approaches include weighted averages or additive methods based on utility functions (Lakiotaki et al. 2008). More advanced techniques use probability distributions to model user preferences across criteria, such as probabilistic latent semantic analysis (Zhang et al. 2009). Multiple-criteria decision analysis can also be used to consolidate the different scores (Lakiotaki et al. 2011). However, all these methods require data containing user-assigned overall ratings, which serve as supervision to guide the aggregation process. Such overall ratings are unavailable in our WebMD.com datasets. Furthermore, the recommendation of drugs should consider both doctors' domain knowledge and patients' ratings, usually incorporated through a utility function, which is not available in our study. Figure 3. Comparison of BMul methods with and without the variable selection component using resampled data for the first two depressionrelated drugs and the first two high blood pressure-related drugs. Here, "topic"/"non" stands for BMul with/without using topic-related variables, neither conducting variable selection; and "lasso" stands for BMul with LASSO for variable selection In real-world scenarios, one can always take the average of predicted multicriteria ratings but this may often be insufficient. It is important to highlight that doctors have the ability to leverage patient-specific information, along with their domain knowledge, to make recommendations based on individual patient preferences across the three criteria. This personalized approach can significantly enhance the accuracy and relevance of drug recommendations. Unfortunately, due to the unavailability of sensitive and personalized data required for such analysis, coupled with the absence of overall rating data and a utility function, we were unable to conduct an experiment to demonstrate the real-world effectiveness of our methodology. We recognize the value of incorporating patient-specific information and data into the recommendation process, but it is beyond the scope of our current study.
Finally, we emphasize that the proposed methodology is intended to be a useful source of information for doctors, rather than being directly used by patients. Caution should be exercised as we are limited by publicly available features and data quality. However, the transparency and explainability of our method enable doctors not only to resort to the predicted ratings but also to understand how the ratings are generated. Further, since it is a Bayesian method, the prediction interval is a natural by-product of posterior sampling, which can help gauge the prediction uncertainty and raise better awareness about the limitations of the output.

Data analysis results
We apply the proposed BMull (Bayesian multitask learning with LASSO) and a list of competitors containing popular classification algorithms to predict patient ratings for different drugs. The list includes the OP, support vector machine (SVM), random forests (RF), NN and AdaBoost (Ada). For each of these existing methods, we consider two versions, with and without using the topic information (_topic versus _non). We report results based on averages over 100 repetitions for each drug. For each repetition, we randomly sample 50 patients and use the corresponding 150 observations (three measures for each patient) to form a training set, and all the remaining observations to form a test set. The accuracy and AUC measures are calculated using the test set. For all methods, the 5-class label is used in the training stage, and the probability of "recommend" (i.e. three stars or above) is used for computing the accuracy (using the cutoff value 0.5) and AUC values in the testing stage. In what follows, we first describe the data processing steps and then report the analysis results for performance evaluation.

Data preprocessing
As described earlier, in addition to the numeric values describing each patient's demographics, we have the actual review content from each patient. We use topic modeling to extract the commonly mentioned features about each drug and use them as part of our knowledge base when recommending a drug to new patients. For some of the drugs, the number of reviews is limited and topic modeling may not produce satisfactory results. Thus, we augment our review samples by including reviews of the same drugs from a Kaggle data set (available at https://www.kaggle.com/jessicali9530/kuc-hacka thon-winter-2018). Using the augmented review samples, we conduct topic modeling (Blei et al. 2003) for each drug by setting the number of topics to be 4, 7, 10, 13, 16, and 19, and use the coherence score for each model as the criterion to select the best topic number. Here, each topic is typically represented as a set of important (or top ranked) words, and the coherence score of a model measures semantically how similar these words within each topic are. Therefore, a higher coherence score means tighter clusters for a model. Further technical details about topic modeling and the preprocessing steps are included in the Supplementary Material.
As the output of topic modeling, for each review, we obtain a probability vector p ¼ ðp 1 ; . . . ; p K Þ, where K is the total number of topics selected based on the coherence score, and p k is a probability measuring how likely this review is related to topic k, satisfying P K k¼1 p k ¼ 1. Then, we use 1=K as the cutoff to create the corresponding topic indicator. For example, if h ¼ ð0:1; 0:02; 0:7; 0:18Þ for a review, then the cutoff is 0.25 and the topic indicator vector is ð0; 0; 1; 0Þ. That is, this review is related to topic 3. Note that the indicator vector can have multiple elements being 1, so a review can be related to multiple topics. For each review, we further introduce the idea of the dominant topic, defined as the topic with the highest probability. In our previous example, the probability vector is h ¼ ð0:1; 0:02; 0:7; 0:18Þ for topics 0-3, respectively. The highest probability 0.7, corresponds to topic 2. Thus, the dominant topic is 2 for this review.
In addition to the topic indicators, we also extracted the sentiment score (ranging from À1 to 1) for each review, using the textblob function in python. A positive/negative sign of the score reflects a positive/negative sentiment, while the value reflects the magnitude of the sentiment. To incorporate this information into the analysis, we created a variable called pri-or_sentiment. For a review, its prior_sentiment is defined as the average sentiment score for the other reviews that occur prior to it and have the same dominant topic. Obviously, the first review for each topic will not have any prior reviews. So, we set the prior_sentiment to 0 for such reviews. This sentiment variable and the topic indicators are included in the model for the topic version of each method.

Comparison of predictive performance
We first show accuracy and AUC results for the first two depression-related drugs and the first two high blood pressure-related drugs in Table 6, where for each evaluation criterion, the best/worst number is highlighted in bold/italic font. The results for other drugs follow similar patterns and so are omitted for brevity.
In terms of both AUC and accuracy, BMull performs the best or close to the best. This indicates that the proposed method can effectively borrow information across tasks and therefore achieve better predictive performance. We observe from Table 6 that topic information is not always helpful in boosting the performance of the existing methods. The same observation is shown in Fig. 3 for the proposed method. For example, for depression-related drug 1, using the topic information leads to worse AUC for all the existing methods (the AUC decreases by 1.2 to 5.0). On the other hand, for depression-related drug 2, the topic information lifts the performance of RF, SVM, and Ada but reduces the performance of OP and NN. Overall, we see that adding the topic information negatively affects the performance of OP the most, which is probably due to the overfitting issue. We further note that conducting variable selection for the existing methods might not change the best numbers in Table 6, as doing so would lead to performance somewhere between the two versions (with and without topic information) for a single drug.
To test the significance of differences between the proposed method BMull and the competing methods, we run a twoway ANOVA model for the logit-transformed performance measure (either accuracy or AUC) with two factors method and drug. The results for accuracy and AUC are reported in Table 7. First, the coefficient estimates for the competing methods are all negative, indicating BMull outperforms the others, regardless of the measure. In terms of accuracy, the differences are statistically significant at the 5% significance level, except for SVM_non whose difference is significant at the 10% level. In terms of AUC, all the differences are statistically significant at the 5% level. In addition, the next best method is not the same for accuracy and AUC.
The AUC comparison for all drugs is given in Fig. 4, in which each boxplot shows the AUC distribution for one particular method. The proposed BMull generally achieves better AUC values: its boxplot shows the highest minimum, maximum, first, second, and third quartiles among all, and its first quartile is higher than or comparable to the medians of the others except for that of OP_non. Also, it is more robust to extra noninformative variables meanwhile can efficiently use topic-related information underlying the variables from topic modeling, compared to other methods. In particular, OP has a huge performance decrease from the without-topic version to the with-topic version.
We also evaluate the predictive performance of different methods on each individual task (measure). The results for the first two depression-related drugs are reported in Table 8. Our method always avoids the worst performance and provides the best performance in 5 out of 12 cases. Interestingly, we also observe that the performance for some of the measures is not consistent with the overall performance. For example, our method produces the best overall accuracy for drug 1, as shown in Table 6. However, SVM produces the best accuracy for "ease of use" and our method falls behind a bit. Note that the accuracy for ease of use is quite high and more than 90% for most methods; thus, multitask learning can work by sacrificing the performance of one task a bit while boosting the overall performance. Table 6. Performance comparison using resampled data for depression-related drugs 1 and 2 (Columns 2-5) and for high blood pressure-related drugs 1 and 2 (Columns 6-9). a

Depression
High blood pressure

Sensitivity to changes in sample size
In the previous subsections, we fix the sample size of the training set at 50. Here, we evaluate the performance of the methods by varying the sample size, and visualize how the performance and time complexity change as the sample size increases in Fig. 5 using resampled data for depression-related drug 4. We observe that, in general, the accuracy and AUC increase as the sample size increases, except for the accuracy of SVM, which stays virtually constant for different sample sizes. In terms of accuracy, the proposed BMull is less sensitive to the sample size change as opposed to OP_topic, NN_topic and Ada, and remains very competitive for both small and large sample sizes. In terms of AUC, BMull is clearly the best while NN and SVM_non are the worst. A further examination reveals that SVM_non predicts almost all observations to be 1 in this analysis when using 0.5 as the cutoff, and so it is not able to differentiate between good and bad ratings. This also explains why its accuracy is nearly constant and its AUC is one of the lowest among all the methods.
In the third panel of Fig. 5, we report the computation times of various methods across different sample sizes. Given that our Bayesian method involves MCMC, we anticipate its computation time to be longer than that of other methods. Specifically, for the sample sizes we tested, OP, SVM, and RF all took <1 s to run, while our method required longer. Nonetheless, our method remains computationally affordable, with the longest running time clocking in at <25 s.

Model interpretation and insights
In addition to improved predictive power, our method is able to produce interpretable results to aid our understanding. Here, we use depression-related drug 2 on the satisfaction measure as an example and report model estimation results including posterior mean, and 95% credible region in Table 9. Unlike frequentist LASSO (Tibshirani 1996), Bayesian variable selection methods do not set any coefficient to be exactly 0. Thus, to identify important topics, we calculate the ratio between a Bayesian estimate and the corresponding frequentist estimate using an ordered probit model for each regression parameter. We use 0.5 as a cutoff for the ratio (Carvalho et al. 2010;Castillo et al. 2015;Li and Pati 2017), i.e. topics with ratio >0.5 are regarded as informative topics. Our Bayesian model identifies two informative topics for the satisfaction rating: topics 1 and 2 contribute negatively to the satisfaction rating, while other topics and prior_sentiment are regarded as irrelevant. To further understand those topics and how they contribute to the rating, the top 20 words for each topic along with a brief summary and some representative reviews for each topic are provided in Supplementary Material. To list a few of the findings: topic 1 has the largest (negative) coefficient, and it mainly discusses about the side effects of the drug. In contrast, the coefficient for topic 3 is very close to 0. A closer examination shows that this topic covers some side effects (ringing in the ear) and some mild treatment effects (helped concentrate). In practice, if patients could provide their preference about each treatment effect and tolerance level toward different kinds of side effects, our interpretable model would lead to a better understanding of the drug recommendation process and facilitate the communication between patients and doctors.

Discussion
We proposed a BMul approach for drug recommendation. A salient feature of the proposed BMul is that multiple measures are modeled simultaneously. Thus, the models can borrow information from each other and utilize information more efficiently. For illustrative purposes, we use patient information and topic information extracted from patient review contents for model construction. we consider BMul with/without constant intercepts across tasks. To filter out nonrelevant information, especially when the number of potential predictors is large, we further propose BMull (i.e. BMul with LASSO). Such added variable selection functionality will be especially useful in situations when practitioners have an access to extra Table 8. Predictive performance for different methods using resampled data for the first two depression-related drugs, evaluated on each of the three individual tasks.

Methods
Satisfaction Ease of use Effectiveness Accuracy AUC Accuracy AUC Accuracy AUC patient information such as those obtained from electronic medical records or their gene-level information.
In general, BMull performs better than a list of benchmark methods in terms of accuracy and AUC. We note that for some data examples considered, the improvement may not be as big as one wishes. As the field of supervised learning has matured, the improvement in performance has become smaller, as the state-of-the-art models have already achieved high accuracy levels on many benchmark datasets. While there is still room for improvement, it has become increasingly challenging to make significant gains in prediction performance, and more focus is shifting toward improving the interpretability, efficiency, and robustness of the model, which are indeed the main advantages of our method. Recall that our method can offer interpretable results, work efficiently for small-sized data, and is robust to possibly noninformative covariates, besides its competitive performance.
Due to the nature of the data collected, besides patients' reviews, only patients' basic information such as age and gender is available, while their actual symptoms are unknown. If more detailed clinical data, such as patients' medical history and symptoms, are available, better recommendations could be made. In such situations, our method can easily incorporate them as extra features. Further, the variable selection feature of our model can tease out the irrelevant features and help make interpretable decisions. Finally, to utilize the proposed method in practice, a doctor can resort to results from our model for factors potentially affecting a patient' drug preference. That information, in combination with our predictions and the doctor's other assessments based on his/her expertise, could lead to a more informed decision.

Supplementary data
Supplementary data are available at Bioinformatics online.